31 research outputs found

    A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages

    Get PDF
    Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree increases the time cost depending on the data structure of the DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree. The string methods consist of the following consecutive steps: searching for a given pattern, then calculating the number of closing HTML elements for this pattern, and finally extracting content for the pattern. In the crawling process, our approach collects the additional information, including the starting position for enhancing the searching process, the number of inner tag for improving the extraction process, and tag repetition for terminating the extraction process. The string methods of this novel approach are about 60 times faster than extracting with the DOM-based method. Moreover, using these additional information improves extraction time by 2.35 times compared to using only the string methods. Furthermore, this approach can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies. © 2013 IEEE

    A regular expression generator based on CSS selectors for efficient extraction from HTML pages

    Get PDF
    Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely Regex Generator (REGEXN), that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 B to 1.59 B. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task. © TÜBİTA

    Examining the Effects of Html, Xml and Web Services on Internet Servers

    Get PDF
    DergiPark: 245962trakyafbdWWW (World Wide Web) kavramının ortaya çıkışı ile birlikte bilginin ve görselliğin saklandığı HTML (HyperText Markup Language) işaretleme dili İnternetin temelini oluşturmuştur. HTML’in veriyi göstermedeki yetersizlikleri nedeniyle XML (Extensible Markup Language) işaretleme dili İnternet dünyasındaki yerini almaya başlamıştır. XML ile birlikte Web Servisleri kavramı doğmuştur. Bu çalışmanın amacı İnternet üzerinde bilginin paylaşımında kullanılabilecek HTML, XML ve Web Servislerinin İnternet sunucularına etkisini incelemektir. Bu etki incelenirken dilbilim uygulamalarında için kullandığımız İnternet tabanlı sözlük uygulamaları ve geliştirdiğimiz sözlük uygulamasından yararlanılacaktır.Since the emergence of the concept of WWW, the HTML markup language that encodes visual or non-visual information has become the base of the Internet. Because of the inefficiency of HTML to show data, the XML markup language has begun to take a place in the domain of Internet. With XML, the concept of Web services came to existence. The aim of this study is to examine the effect of HTML, XML and Web Services on Internet servers. When examining this effect, we will make use of some dictionaries available on the Internet and similar dictionary which we developed to use in linguistic applications

    Examination of Extraction Rules in Web Data Extraction

    Get PDF
    Gerekli veriyi web sayfasından çıkarmak veri madenciliği ve bilgi erişimi alanındaki uygulamalar için önemlidir. Web sayfasından veriyi çıkarmak için DOM tabanlı yöntemler veya düzenli ifadeler kullanılabilir. Bu çıkarım işlemi için hem DOM tabanlı yöntemler hem de düzenli ifadeler için birden fazla çıkarım kuralı hazırlanabilir. Bu çalışmada, çıkarım kuralları ile birden fazla veriyi elde etmenin çıkarım işlemi üzerindeki etkinliği incelenmiştir. Veri seti olarak haber, film ve alış/veriş alanlarında olmak üzere on beş web sitesi seçilmiştir. Bu web siteleri için farklı çıkarım teknikleri ile veri çıkarımı için çıkarım kural dosyaları oluşturulmuştur. Web sitelerinde özellikle yorum gibi tekrarlayan veriler üzerinde odaklanmıştır. Deneyler, oluşturulması daha zahmetli ve zaman alıcı düzenli ifadelerin DOM tabanlı yöntemlere göre çok daha iyi sonuçlar verdiğini göstermiştir. DOM tabanlı yöntemler arasında beklenildiği gibi lxml ayrıştırıcı kütüphanesi en iyi sonuçları vermiştir. Deneyler, bir geliştirici tarafından hazırlanan çıkarım kurallarının çıkarım süresini etkilediği göstermektedir. Sonuç olarak, iyi hazırlanmış çıkarım düzenli ifadeleri ile web sayfalarında çok daha hızlı bir şekilde istenilen veriye erişmek mümkündür.Extracting the desired data from a web page is an important issue for applications in the fields of data mining and information retrieval. DOM-based methods or regular expressions can be used to extract data from a web page. For this extraction process, multiple extraction rules can be prepared for both DOM-based methods and regular expressions. In this study, the effectiveness of obtaining repetitive data using extraction rules is investigated. As a data set, fifteen websites including in the fields of news, films, and shopping have been selected. Extraction rule files have been created for data extraction with different extraction techniques for these websites. Websites are mainly focused on repetitive data such as reviews. Experiments have shown that regular expressions, the creation process is more laborious and time-consuming, give better results than DOM-based methods. Among the DOM-based methods, the lxml parser library provided the best results as expected. Experiments indicate that the extraction rules prepared by a developer affect the extraction time. As a result, it is possible to extract the desired data much faster in web pages with the well-prepared regular expressions

    AKILLI DERS YÖNETİM SİSTEMİ İLE PROGRAMLAMA ÖDEVLERİ İÇİN İNTİHAL TESPİTİ UYGULAMASI

    Get PDF
    Ödevler, eğitimin en önemli öğelerinden biridir. Günümüzde, ödevler internet üzerinden verilebilmekte ve aynı şekilde alınabilmektedir. Verilen bir ödevin değerlendirilmesinde gönderilen ödevin genelde sadece doğru olup olmadığı kontrol edilebilmektedir. Kimlerin benzer ödev gönderdiği, ödevlerin hangi oranda benzerliğe sahip olduğu ve ödevlerin hangi kısımlarının benzerliğinin fazla olduğu gibi ayrıntılar ne yazık ki değerlendirmeyi zorlaştırmaktadır. Bu projenin amacı, hiyerarşik kümeleme yöntemlerini kullanarak ödev değerlendirme sürecine katkıda bulunacak bir makine öğrenmesi sistemini geliştirmektir. Bu projede, “Akıllı Ders Yönetim Sistemi - (ADYS)” adında bir web tabanlı uygulama geliştirdik. Bu uygulamada öğretim üyesi ders notlarını, uygulama notlarını ve ödevleri internet üzerinden düzenleyebilmektedir. Öğrencilerde öğretim üyesinin verdiği notları görebilmelerin yanı sıra kendine ait ödevi bu sistem üzerinden dersin öğretim üyesine gönderebilmektedir. Öğrenci bir sıkıştırılırmış dosya (zip veya rar), bir metin belgesi (txt, cs, aspx, java, php vb.) veya zenginleştirilmiş metin belgesi (doc, xls, ppt veya pdf gibi) olarak ödevi gönderebilmektedir. Öğretim üyesi ödevleri aldıktan sonra ödevleri inceleyerek uygulama üzerinden öğrenciye notunu vermektedir. Bu güne kadar gönderilen ödevler ayrıntılı incelendiğinde birçok öğrencinin benzer ödevler gönderdiği görülmüştür. Ancak öğrenciye verilen not genelde ödevin doğru olup olmadığına ait bir nottur. Orijinal ödev yapan öğrenciler benzer ödev gönderen öğrencilerle aynı notu almaktadır. Bu sebepten dolayı, bu projede öğretim üyesi için orijinal ödev kontrolünü kolayca sağlayabileceği bir sistem tasarlanmıştır. Böylece, bu uygulama üzerinden verilen ödevlerin öğrencileri orijinal ödev hazırlamaya teşvik edeceği düşüncesindeyiz.Assignments are one of the most important subjects of education. Recently, assignments can be given through internet and answers can be easily collected via the same way. In assessment of a given assignment, the uploaded assignment can only be checked whether it is correct or not. Unfortunately, details such as students who submitted a similar assignment, the similarity of assignments and the similar parts of an assignment makes the assessment difficult. The aim of this project is to develop a machine learning system that will contribute the assessment process of assignments by using hierarchical clustering methods. In this project, we have developed a web-based application, namely “Intelligent Lesson Management System”. In this application, a lecturer arranges the assignments, lecture contents and notes, and grades through a web application. Likely a student receives these notes and sends his/her own assignments in compressed formats (zip or rar) and text formats (txt, cs, aspx, java, php, and etc.) as well as enriched text formats (doc, xls, ppt or pdf) through this application. After the assignments are taken by a teacher, the grades are given from the web page. When the assignments are investigated in detail, it is observed that most of the parts of answers are very similar. However the grades are given according to the right answers but not how much they are original. Some students may copy the answers from each other, and this causes these students to get the same grade. Similarly some students who give original answers get the same grades with the students who give correct answers. In order to prevent this situation, this project aims an application which is guided the lecturer to assess the originality of the answers. Therefore, the assignments given by this application will force the students to prepare original answers

    An internet-based automatic learning system supported by information retrieval

    No full text
    Doktora TeziDoktora Tezi Trakya Üniversitesi Fen Bilimleri Enstitüsü Bilgisayar Mühendisliği Bölümü ÖZET Bu tez, Türkçe için alt öğeleme listelerinin otomatik olarak elde edilmesi görevini gerçekleştirmek için planlanan web-tabanlı bir sistemi sunar. Zamir düşmesi, seyrek gösterimli bir dil ve serbest sıralaması özellikleri olan Türkçe doğal dil işleme görevleri için ilginç ve zorlukları olan bir uygulama alanı sağlar. Tez; bilgi erişimi, doğal dil işleme ve makine öğrenmesi alanlarına katkıda bulunmayı amaçlar. Öncelikle, doğal dil işleme ve makine öğrenmesi çalışmalarını kullanan çoklu derlemin otomatik olarak oluşturulmasını sağlayan bir web-tabanlı yaklaşım önereceğiz. Bunun için, arama motorlarını kullanarak internet üzerinden dilbilimsel Türkçe cümleleri toplayan ve hal durum bilgileri açısından bunları işaretleyen bir araç geliştirildi. İkincil olarak; rastgele seçilmiş Türkçe fiillere ait alt öğeleme listelerini elde etmek için oluşturulan derleme çeşitli makine öğrenme metotları uygulanmıştır. Üçüncül olarak; veri boyutunun metotların performansına etkisini anlamak için bu veri boyutu farklı boyutlarda birkaç alt kümeye bölünmüştür. Son olarak; özellikle gözetimli ve gözetimsiz metotların arasındaki farka odaklanan deneylerimizde kullanılan metotların karşılaştırmalı değerlendirilmesi önerildi. Tezin organizasyonu şu şekildedir. İlk bölüm, bilgi erişimi, alt öğeleme listesi ve makine öğrenmesi kavramları hakkında ön bilgiler verir. Ayrıca, bu bölüm ilgili çalışmalara ve bilgisayımsal bakış açısıyla incelenecek bir dil olarak Türkçe'nin ayırt edici özelliklerine temas edecektir. İkinci bölüm, deneylerde kullanılan bazı makine öğrenmesi algoritma ve tekniklerini tanıtır. Üçüncü bölümde, doğal dil çalışmaları için uygun büyük bir veri seti olan ?web olarak derlem? görüşü anlatılacaktır. Dördüncü bölüm, önerilen sistemin tasarımını ve uygulamasını verir. Beşinci bölüm, deneylerimizdeki sonuçları raporlar ve performansın farklı veri boyutlarına etkisini gözlemler. Ayrıca, deneylerde kullanılan metotların bir karşılaştırmalı değerlendirilmesini sağlar. Tez, altıncı bölümde ana bulgular ve sonuçların özeti ile bitirilmektedir. Anahtar Kelimeler: Alt öğeleme listesinin otomatik elde etme, makine öğrenmesi metotları, bir derlem olarak webDoctorate Thesis Trakya University Graduate School of Natural and Applied Sciences Department of Computer Engineering This thesis presents a web-based system that is intended to perform the task of automatic acquisition of subcategorization frames for Turkish. As a pro-drop, a referentially sparse and free word order language, Turkish provides an interesting and challenging domain of application for natural language processing tasks. The thesis aims to contribute to the fields of information retrieval, natural language processing and machine learning in the following respects. Firstly, we offer a web-based approach to the automatic construction of corpora to be used in natural language processing and machine learning work. To this effect, we implemented a tool that collects grammatical Turkish sentences from internet via search engines and annotates them with respect to case marking information. Secondly, various machine learning methods were applied to the generated corpus in order to acquire the subcategorization frames of a set of randomly chosen Turkish verbs. Thirdly, we divided our set of patterns into several subsets of different sizes to understand effect of data size on the performance of methods. Lastly, we offer a comparative evaluation of the methods used in our experiments, focusing particularly on the distinction between supervised and unsupervised methods. The thesis is organized as follows. The first chapter gives a brief account of the concepts of information retrieval, subcategorization frame and machine learning. Moreover, this chapter touches upon the relevant literature and the peculiarities of a Turkish as a language to be investigated from a computational point of view. The second chapter introduces some machine learning algorithms and techniques used in our experiments. In the third chapter, we describe the view of web as a corpus that is the largest data set available for natural language studies. In the fourth chapter, the design and implementation aspects of the proposed system are given. The fifth chapter reports on the results of our experiments and provides a comparative evaluation of the methods used in the experiments along with observations on the effect of data size on the performances. The thesis ends with a summary of major findings and conclusions in chapter six. Keywords: Automatic acquisition of subcategorization frames, machine learning methods, web as a corpu

    A fuzzy ranking approach for improving search results in Turkish as an agglutinative language

    No full text
    This study proposes a fuzzy ranking approach, designed for Turkish as an agglutinative language, that focuses on improving stemming techniques via using distances of characters in its search algorithm. Various studies focused on search engines are based on using stemming techniques in indexing process because of the higher percentage of relevancy that these techniques provide. However, stemming techniques may have negative effects on search results in some queries. While analyzing the search results to find the query terms those give irrelevant results and why, we observe that user's query suffixes are crucial in search performance. Therefore, the proposed fuzzy ranking approach supports traditional stemming approaches with the use of suffixes. The search results of this approach are significantly better than stemming techniques in where stemming technique is ineffective. In terms of overall results, the fuzzy ranking approach also gives satisfactory results when compared with stemming techniques such as a Turkish stemmer (19.43% of improvement) and word truncation technique (12.61% of improvement). Moreover, it is statistically better than no stemming with 28.61% of improvement. (C) 2011 Elsevier Ltd. All rights reserved

    An efficient regular expression inference approach for relevant image extraction

    No full text
    Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. However, these operations are difficult and laborious. In this study, we propose a fully-automated approach based on alignment of regular expressions to automatically extract the relevant images from web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers. © 202

    Farkli terim agirliklandirma yöntemleri kullanarak yazar tanima]

    No full text
    21st Signal Processing and Communications Applications Conference (SIU) -- APR 24-26, 2013 -- CYPRUSIn this study, the impact of term weighting on author detection as a type of text classification is investigated. The feature vector being used to represent texts, consists of stem words as features and their weight values, which are obtained by applying 14 different term weighting schemes. The performances of these feature vectors for 3 different datasets in the author detection are tested with some classification methods such as Naive Bayes Multinominal (NBM), and Support Vector Machine (SVM), Decision Tree (C4.5), and Random Forrest (RF), and are compared with each other. As a result of that, the most successful classifier, which can predict the author of an article, is found as SVM classifier with 98.75% mean accuracy; the most successful term weighting scheme is found as ACTF.IDF.(ICF+1) with 91.54% general mean accuracy

    Performance Evaluation of Classification Methods in Layout Prediction of Web Pages

    No full text
    International Conference on Artificial Intelligence and Data Processing (IDAP) -- SEP 28-30, 2018 -- Inonu Univ, Malatya, TURKEYThe Web is an invaluable source of data stored on web pages. These data are contained in HTML layout elements of a web page. It is a crucial issue to extract data automatically from a web page. In this study, a dataset, which is annotated with seven different layouts including main content, headline, summary, other necessary layouts, menu, link, and other unnecessary layouts, is used. Then, 49 different features are computed from these layouts. Finally, we compare the different classification methods for evaluating the performance of these methods in layout prediction. The experiments show that the Random Forest classifier achieves a high accuracy of 98.46%. Thanks to this classifier, the prediction of link layout has a higher performance (approximately 0.988 f-Measure) according to the performance of the prediction of other layouts. On the other hand, the prediction of the summary layout has the worst performance with about 0.882 f-Measure.Inonu Univ, Comp Sci Dept, IEEE Turkey Sect, Anatolian SciNamik Kemal University Research FundNamik Kemal UniversityThe authors acknowledge the support received from the Namik Kemal University Research Fund
    corecore